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ON ADAPTIVE ESTIMATION OF LINEAR FUNCTIONALS 1 
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University of Pennsylvania 

Adaptive estimation of linear functionals over a collection of pa- 
rameter spaces is considered. A between-class modulus of continuity, 
a geometric quantity, is shown to be instrumental in characterizing 
the degree of adaptability over two parameter spaces in the same way 
that the usual modulus of continuity captures the minimax difficulty 
of estimation over a single parameter space. A general construction of 
optimally adaptive estimators based on an ordered modulus of con- 
tinuity is given. The results are complemented by several illustrative 
examples. 

1. Introduction. Adaptive estimation of linear functionals occupies an 
important position in the theory of nonparametric function estimation. As 
a step toward the goal of adaptive estimation, attention is first focused on the 
more concrete goal of developing a minimax theory over a fixed parameter 
space which can, for example, specify the smoothness of the function. This 
theory is now well developed, particularly in the white noise with drift model 

(1) dY(t) = f(t)dt + ^=dW(t), -l<t<L 



where W(t) is a standard Brownian motion. This model arises as an ap- 
proximation to many other nonparametric models such as those of density 
estimation, nonparametric regression and spectral estimation. See, for ex- 
ample, [1, 2, 19, 21]. 

Based on white noise data, Ibragimov and Hasminskii [15] constructed lin- 
ear estimators with the smallest maximum mean squared error over convex 
symmetric parameter spaces. Donoho and Liu [9] and Donoho [8] extended 
this theory to general convex parameter spaces in terms of a modulus of 
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continuity, 

(2) u(e,T) = sup{|T ff - Tf\ : \\g - f\\ 2 < e; f G T,g E T}. 

Affine estimators play a fundamental role in this theory. For a convex func- 
tion class T and linear functional T, set the minimax affine risk R\(n, J-) = 
inf^, affine supj g: p-E'(T — Tf) 2 and the minimax risk R* N (n, J 7 ) = mifSupj- e:F E(T— 
Tf) 2 . Donoho and Liu [9] and Donoho [8] have shown that 

(3) i W 2 ( j=,r) <R^(n,F)<R* A (n,F)<u 2 (^,F) 

and that the modulus can be used to construct the optimal affine procedure. 

A natural way to extend the minimax theory to an adaptation theory 
is to construct estimators which are simultaneously near minimax over a 
collection of smoothness classes. In general, however, this goal cannot be 
realized. Lepski [17] was the first to give examples which demonstrated that 
rate optimal adaptation over a collection of Lipschitz classes is not possible 
when estimating a function at a point. Efromovich and Low [14] showed that 
this phenomenon is true in general over a collection of nested symmetric sets 
where the minimax rates are algebraic of different orders. See also [16]. 

On the other hand, the goal of fully rate adaptive estimation of linear 
functionals can sometimes be realized. When the minimax rate over each 
parameter space is slower than any algebraic rate, Cai and Low [5] have 
given examples of nested symmetric sets where fully adaptive estimators 
can be constructed. In addition, when the sets are not symmetric, there are 
also examples where rate adaptive estimators can be constructed. Such is 
the case for estimating monotone functions where an estimator can adapt 
over Lipschitz classes. See [20]. Other recent results can be found in [10, 11, 
12, 13, 18]. 

Although the above-mentioned examples show that there are cases where 
fully rate adaptive estimators exist and other cases where fully rate adaptive 
estimators do not exist, to date there is no general theory that characterizes 
exactly when adaptation is possible. The present paper provides a general 
adaptation theory for estimating linear functionals. We develop a geometric 
understanding of the adaptive estimation problem analogous to that given by 
Donoho [8] for minimax theory. This theory describes exactly when fully rate 
adaptive estimators exist, and when they do not exist, the theory provides 
a general construction of estimators with minimum adaptation cost. 

This paper and its companion papers Cai and Low [6, 7] develop a co- 
herent approach to minimax theory, adaptive estimation and the construc- 
tion of adaptive confidence intervals. The theory relies on two geometric 
quantities — a between-class modulus of continuity and an ordered modulus 
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of continuity. For a pair of parameter spaces T\ and T 2 , the ordered modulus 
of continuity is denned by 

(4) u(s, T X ,T 2 ) = sup{Tg - Tf :\\g - f\\ 2 < e; / € T^g £ T 2 }. 

The ordered modulus of continuity is instrumental in the construction of 
the adaptive estimators given in Sections 2, 4 and 5. It is a quantity derived 
from the geometry of the graph of the linear functional T between T\ and 
J- 2 . It is also convenient to define a between-class modulus of continuity 
u+ie,?!,^) by 

(5) w+(e,^i,^ 2 ) = swp{\Tg - Tf\ : \\g - f\\ 2 < e; f £ T u g £ T 2 }. 

Clearly, u! + (e,J r i,J r 2 ) = max{uj(s,'Fi,'F 2 ),u;(e,'F 2 ,'Fi)}. When T\ = T 2 = 
J 7 , ^(e,^ 7 , J 7 ) = a;_|_(e,JT, JT) is the usual modulus of continuity over T and 
will be denoted by u>(e,J-) as in (2). We show that the between-class mod- 
ulus can be used to characterize when adaptation is possible. This modulus 
captures the degree of adaptability over two parameter spaces in the same 
way that the usual modulus of continuity captures the minimax difficulty of 
estimation over a single parameter space. 

We begin in Section 2 with a complete treatment of adaptation over an 
arbitrary pair of convex parameter spaces and any linear functional. In par- 
ticular, we do not assume that the parameter spaces are nested or sym- 
metric. A general construction for an optimally adaptive estimator is given. 
The adaptive estimator is based on appropriate tests between the parameter 
spaces which rely on a general understanding of the possible tradeoffs of bias 
and variance using the ordered moduli of continuity. 

The theory shows that there are three main cases in terms of the cost 
of adaptation. We shall call the first case the regular one where, as in the 
case of estimating a function at a point over Lipschitz classes, the cost of 
adaptation is a logarithmic factor of the noise level. In the second case, 
full adaptation is possible as in the examples considered in [5, 18]. More 
dramatically, in the third case, the cost of adaptation is much greater than 
in the regular case. The cost of adaptation in this case is a power of the 
noise level. Examples of all three cases are given in Section 3. 

Section 2 gives a geometric characterization of adaptation and shows the 
fundamental role played by the between-class and ordered moduli of conti- 
nuity in this theory. The adaptation theory over two spaces in turn provides 
a fundamental building block for adaptation over richer collections of param- 
eter spaces. In Section 4 we extend this theory to any collection of finitely 
many nested convex spaces, and under mild regularity conditions on the 
modulus to finitely many nonnested convex parameter spaces. The focus of 
this section is on the construction of an estimator with minimum adapta- 
tion cost. In Section 5 we further generalize the results to a continuum of 
parameter spaces. 
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2. Adaptation over two parameter spaces. In this section we give a com- 
plete development of adaptation over an arbitrary pair of convex parameter 
spaces T\ and T2 with T\ fl T2 7^ and any linear functional T. We first de- 
rive a benchmark for the performance over T<i of any minimax rate optimal 
estimator over T\. The benchmark is given in terms of the between-class 
modulus of continuity. A general construction for an optimally adaptive es- 
timator is then given. The adaptive procedure is built on a test between the 
parameter spaces which is based on the tradeoffs of bias and variance using 
the ordered moduli of continuity. Taken together these results show that the 
moduli of continuity captures the degree to which adaptation is possible. 

Throughout the paper, we denote by C a generic constant that may vary 
from place to place. 

2.1. Lower bound on the cost of adaptation. Let the ordered modulus of 
continuity uj(e, ^1,^2) he defined as in (4) and the between-class modulus be 
given as in (5). Note that u(e, Fi, T2) does not necessarily equal w(e, .^.Fi). 
It is however clear that the modulus to(e, T\, T2) is an increasing function 
of e. Moreover, if T\ and Ti are convex with T\ H T2 7^ 0, then for a linear 
functional T the modulus uj(e, T<i) is also a concave function of e. See [6]. 
Note also that although oj + need not be concave, it follows from the concavity 
of the ordered modulus of continuity that for D > 1, 



The following result gives the lower bound for the maximum risk over T2 
for minimax rate optimal estimators over T\. 

Theorem 1. Let T be a linear functional and let T\ and T2 be "parame- 
ter spaces with ^n^/ and oj{e,!Fi) < ^(e,^) f or all sufficiently small 
< e < £0 • Suppose that T is an estimator of Tf based on the white noise 
data (1) satisfying 



for some constant c* > 0. Let j n = max{e, }■ Then there exists 

some fixed constant c > such that, for all sufficiently large n, 



(6) 






Proof. We shall only consider the case where T\ and T2 are closed and 
norm bounded. The general case is proved by taking limits of this case as in 
Section 14 of [8]. 
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For the case of 7 n = e, (8) follows directly from (3). Now assume that 
7n > e. Then sup /eFl E f (f - Tf) 2 < ^ 2 uj% (^,Fi,F 2 )- Choose f 1>n G T x 



and /2, n G T% such that ||/i, n — /2,n||2 < v/ and such that the between- 
class modulus is attained at {fx >n , / 2 ,n} : |T/ 2 , n -r/i )n | = cu + ( \J Xl ^-, T\,T<i). 
It then follows from the constrained risk inequality of Brown and Low [3] 
and equation (6) that 



'ln7 n 
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and hence by equation (3) sup /g ^ 2 E f {f-Tf) 2 > ±{uj% (y^,^, T 2 ) + 

Theorem 1 considers the performance over Ti of estimators which are 
minimax rate optimal over T\. This is a particularly important case but 
we shall also need a more general bound when we discuss adaptation over 
collections of parameter spaces. The proof of the following theorem is similar 
to that of Theorem 1 and is thus omitted. 

Theorem 2. Consider two function classes T\ andJ- 2 with T\V\T 2 7^ 0. 
Let T be a linear functional and suppose that 

(9) sup£; / (f-T/) 2 <7- 2 £ 4f-^,^ 1 ,^ 2 ) 

for some j n > 1 . Then for any < p < 1 , 
sup [E f {f-Tf) 2 ] l l 2 

(10) ^ 



2.2. Construction of optimally adaptive procedure. We now turn to a 
general construction of an adaptive procedure for any given linear func- 
tional T over any two convex parameter spaces T\ and T 2 with nonempty 
intersection T\ n T 2 7^ 0. 

Before describing the adaptive procedure first focus attention on each 
parameter space separately. If it were known that / € Ti then the theory 
of Donoho and Liu yields linear estimators Tj which satisfy sup^-p. E(Ti — 
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Tf) 2 < uj 2 (-^, J~i). Moreover these estimators are minimax rate optimal 
over Ti . The adaptive procedure is then based on a test between T\ and Ti ■ 
If the test accepts T\ then the procedure uses T\ whereas if it rejects T\ the 
procedure uses a minimax rate optimal procedure over T\ U T<i ■ The test is 
designed in such a way that if / G T\ it has a small probability of rejecting 
T\ ■ On the other hand if / € Ti and the bias of T\ is large the test has only 
a small probability of accepting T\ ■ 

In the case where T\ and Ti are nonnested convex parameter spaces it is 
clear that an implementation of this approach requires a minimax analysis 
for sets which are not convex. The reason is that we need to know the 
minimax risk and minimax rate optimal procedure over the union Q = T\ U 
T2 , which is in general nonconvex. Such a theory has been given [6] where it 
was shown that if Q is a union of a finite number of closed convex parameter 
spaces, the minimax risk is of the order uj 2 (-^,G). Moreover, explicit rate 

optimal procedures, say T|, were constructed which for Q = T\ U satisfy 

(11) S u V E(f*-Tff<C^(^G 

f&3 



n 



In the adaptive procedure T% is used whenever T\ is rejected. 

The test between T\ and Ti is based on a comparison of linear estimators 
which trade bias and variance over T\ and Ti in a precise way. This trading 
of bias and variance is based on results in [6] which show how to use the 
ordered modulus of continuity to construct a linear procedure which has 
upper bounds for the bias over one parameter space and lower bounds for 
the bias over the other parameter space. More specifically, for two convex 
sets T and 7i with f a linear estimator T is given which has 

variance and bias satisfying 

(12) Var(f ) = E(T - ET) 2 < V, 

(13) sup(ET-Tf) < |sup(o;(e,J r ,W) - 

f£F e>0 

(14) mHET-Tf)>-\svL V {u(e,F,H) 

/Srt e >0 

For a given bound V on the variance this theory leads to two linear 
estimators by interchanging the roles of T and TC. In our context we make 
two different choices for V. For 1 <i^j < 2 let 

7i = max e, ; — and 

(15) 

7+ = max(7i j2 ,72,i = m &x e, — 

V w l/vn,/i / 
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The estimators Tjj for 1 < % ^ j < 2 needed in the test are linear estimators 
satisfying (12)-(14) for T = Fi, H = Tj and V = a\ y 

The test given below relies on an understanding of the bias properties of 
Ti 2 and T 2 .\- Simple bounds on the bias are easy to obtain from (13) and (14) 
since 



sup(cj(e, T i: Tj) - eJnah' 

£>0 V ^ 



The test is based on a comparison of the estimator T\ and both Tx s 2 and 

Note that if fe T\, 

E{f x - fi, 2 ) = E(f x - Tf) - E{f ly2 - Tf) 
(16) , 



n I \ V n 



E(T X - T 2 ,i) = E(T\ - Tf) - E{T 2 ,x - Tf) 

(17) 



( -U-^i 

in 



-^-,J r 2,T\ ) =&2,1) 

n y 



(18) 



(19) 



= "1,2, 



= "2,1- 



Hence, if / G .Fi it is easy to select a value so that the chance that T\ — T 2> \ 
is greater than that value is small. Likewise it is easy to select another value 
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so that the chance that T\ — T\ t 2 is less than that value is small. A careful 
selection of these values leads to the following test between T\ and T 2 '- 

(20) I n = 1 (f lt2 - 56i )2 -4u;(^=,g\ <f ± < f 2)1 + 56 2)1 + 4to f-j=,g 



The value I n = 1 corresponds to accepting .Fi, in which case 2\ is used. 
The value I n = corresponds to rejecting J 7 !, in which case T| is used. The 
adaptive estimator can then be written as 

(21) f = I n f 1 + (l-I n )f 2 *, 

where T± satisfies sup^ E{T\ — Tf) 2 < lo 2 (^,Ti) and T| satisfies (11). 

2.3. Adaptivity of the procedure. In the previous subsection an estima- 
tor T was constructed based on a test between two parameter spaces. In 
this section we show that this estimator is adaptively rate optimal over T\ 
and Ti- As a consequence it is shown that the lower bound for adaptation 
between T\ and Ti as given in Theorem 1 is sharp. The following theorem 
summarizes these results. 

Theorem 3. Suppose T\ and J~2 are two closed convex parameter spaces 
with T\ n Ti 7^ and uj{e,J-{) < oj(e,J-2)- The estimator T defined in (21) 
satisfies for some fixed C > 

(22) sup E(f-Tfy<CJ 1 (^ 7 =,F l 
f&Fi 

and 



n 



(23) supE(f-T/) 2 <cL^y'^ ) jF 1 ,jP 2 j +i0 ^l=^ 2 

where 7 + is defined in (15). 

In light of the lower and upper bounds given in Theorems 1 and 3 we give 
the following definition. 

Definition 1. We shall call an estimator T optimally adaptive over 
T\ and Ti if it satisfies both (22) and (23). 

Remark. The estimator T defined in (21) is also adaptive between T\ 
and Q = T\ U Ti- Note that (23) is equivalent to 



(24) s ^E(f-Tff<cLlU^,^,g)+ui^,g 



n \ \ n 



ON ADAPTIVE ESTIMATION OF LINEAR FUNCTIONALS 



9 



where 7! = max ( e , u+ S ^/j^^ &■ ). Therefore T attains the exact minimax 
rate of convergence over T\ and attains the lower bound on adaptation over 
Q as given in Theorem 1. 

As mentioned in the previous section, the estimator T was constructed 
by testing between T\ and Ti . The proof of Theorem 3 is based on a precise 
analysis of the properties of the test as given in the following lemmas. 

This test is constructed so that for / G T\ the probability of rejecting T\ 
is small. Lemma 1 below provides a specific bound on the rejection of T\ 
when / G T\ . 

Lemma 1 . // / e T\, then 



(25) P(I n = 0) < 



w 4 (l/v^,g) 



Proof. First note that for a standard normal random variable Z, P{Z > 
2 

(46 lj2 + 4u;(l/^,a)) 2 ' 



A) < exp(-4r) holds for all A > 0. It then follows from (16)~(20) that 



P(/„ = 0)<exp(- ' 4 ^ + t (1/ ^ g>) ) 
V 2v 1)2 / 

(46 2 ,i+4u;(l/ v ^,g)) 2 



2U2,1 



+ exp 



First note that if w 2 (^, F x ) > ^^u 2 (\f ^^,^1,^2), then since e~ 2x < 
\x~ 2 for x > 0, it follows that 

/ {Ab ia + Aio{i/^i,g)f\ ( n u; 2 (i/^i,g) 

exp ■ < exp ' 



2v 1>2 J- ^\ w 2 (l/V^i) 



(26) 

" 2 uo 4 (i/^,g) 

On the other hand, if w 2 (^,^i) < ^-^{yf^-^x,^), then 
(4& li2 +4w(l/^,e)) 2 ' 



exp 



2r 



1,2 



< 



exp 



16w 2 (Jln 7li2 /n,^ 1 , T 2 ) + 16u; 2 (l/v^,0) 



(27) <exp - 41n 7 i i2 + 2 



(8/ln7i j2 )w 2 (y In 71,2/n, ^1, ^2) 
u; 2 (l/^,S) 



W 2 (l/VH,^i,^2) 



10 



T. T. CAI AND M. G. LOW 



<exp(-(41n 7l , 2 + 2 ^1^9) 



< 



u 4 (l/y/E,ft) 1^(1/^,^,5) lw 4 (l/^,fi) 



uA{i/^i,ft,g) 2 io^i/^g) 2 ^{i/^i,g) ■ 



Hence combining (26) and (27) yields exp(- (4feu+ ^^ |g))2 ) < ^(Y/^r 

A similar argument shows exp(- ^y 8 "' ) < and (25) 

follows. □ 

The test also has a large probability of rejecting ft when / £ JF 2 and 
the bias of Ti is large since in such a case either E{T\ — T 2; i) is large or 
E(T\ t 2 is large. The following lemma gives a useful upper bound on 

the probability of using T\ in this case. 

Lemma 2. If f e?2 and \ETx-Tf\ > A(&i j2 + & 2 ,i +w(^, £)) for some 
A > 6, then 

(28) P(/ n = l)<e^ A - 6 ) 2 / 4 . 

PROOF. We shall only give the proof when ET\ -Tf > A(i>i )2 + 62,1 + 
oj(^,Q)), as the case when ET\ — Tf < — A(&i j2 + 62,1 + u; ("^'^)) can 

be handled similarly. Let f e ft. Then P(I n = 1) < P(f x - f 2 ,i < 56 2 ,i + 
4w(^=,0)). Note that 



£ (fx - f 2 ,i - 562,1 - 4w f-^=, 



£?(2\ - Tf) - E(T 2il -Tf) - 56 2) i - 4cu( 



71 



Now Var(Ti - f 2jl ) < W2 ,i = 2(u; 2 (^,^) + ^-^^.^ft)) yields 



(A - 6) 2 (^(^72,1/12, ^1) +^(l/v^^)) 2 

P(I n = l)<exp' 



< exp 



2 U2,i 
(A-6) 2 ^ 
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□ 

The proof of Theorem 3 now follows from Lemma 1 for (22) and from 
Lemma 2 for (23). 

Proof of Theorem 3. The minimax rate optimality of T over T\ 
follows directly from Lemma 1 and the fact that T| satisfies (11): 

sup E(f - Tf) 2 < sup E(Ti - Tf) 2 



■ su P (E|r 2 * -r/i 4 ) 1 / 2 • (p(i n = o)) 1 / 2 

.fe.fi 

1 nr\ , ^ .l( 1 „W(VvVl) 



\Jn' ) Xy/n' J LU 2 (l/y/n,G) 



( A=,?i 

In 



and thus (22) holds. The proof of (23) is broken into two parts. If / G J- 2 
and \ETt - Tf\ < 6(b 1>2 + h,i + u(^, G)), then 

E(f - Tf) 2 < E(f x - Tf) 2 + E(f£ - Tf) 2 



(29) 



n 



n I \ v n 



+ Cuj z [ —,g 

in 



where C is a constant not depending on /, and hence in this case (23) holds. 

Now note that if X has a normal distribution with mean a and variance 
a 2 , then 

(30) (M 4 ) 1/2 <3(/i 2 + (7 2 ). 

Hence if / € T 2 and \ET\ - Tf \ > X(b 1)2 + b 2A + for some A > 6, 

it then follows from Lemma 2 and inequality (30) that 

E(T - Tf) 2 < (E\T! - T/| 4 ) 1/2 • (P(J„ = 1)) 1/2 + E(f% - Tf) 2 
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-(A-6) 2 /8 



< (3Var(Ti) + 3A^& li2 + & 2il +u^^=,g) ) ) e 



(31) 



v 4 ' 

where the constant C does not depend on /, and so (23) also holds in this 
case and the theorem follows. □ 

3. Examples. Section 2 develops the general theory of optimally adaptive 
estimation over two convex parameter spaces. The results can be usefully 
explained in an alternative way. Let T\ and Ti be two convex parameter 
spaces with nonempty intersection and J-\) < ^(e,^) for < e < Eq. 
Let T n ,c{J~i) be the collection of estimators which satisfy 

T n , c {Fi) = (f: sup E f (f - Tff < c 2 u?(^T, 
I fen VV" 

and let 

(32) Rn,c(Fi^)= inf sup E f (f-Tf) 2 . 

The quantity i^cf^ij^) gives the optimal performance over Ti for min- 
imax rate optimal estimators over T\. Theorems 1 and 3 taken together 
quantify R n ,c{J'\,T2) in terms of the between-class modulus of continuity 

as 



(33) RnAFu^-ul ( \/— ,^i,^2 I +u 2 (^=,F 2 



n \ \/n 



where a n X b n means that a n /b n is bounded away from and oo as n —> oo. 

In most common cases when estimating a linear functional over convex 
parameter spaces the moduli are Holderian, 

(34) u+ie^fj) = C itj e^^\l + o(l)), 

where we shall write q^i) for qiTi^Ti). In such cases especially clear and 
precise statements can be made which are direct consequences of Theorems 
1 and 3. 
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Corollary 1 . Let T\ and be convex parameter spaces with J-iD^^^ 
and let T be a linear functional. Suppose Lu+fe,^,^) are Holderian with 



exponent q{Ti,Tj) forij = 1,2. Ifq^t,^) = q{^2) < q(Fi) or q{T\,T<i) < 
qiFi) < a (Fi), then 



where < C\ < C2 are constants and Rn,c(J~iiJ~2) is defined above in (32). 

Corollary 1 can then be used to classify the problem of adaptation over 
convex parameter spaces into three cases: 

• Case 1. q{T\^J : 2) = min(g(.Fi), q{J r 2)) < max(g(J r i), q(J r 2))- This is the 
"regular case" which holds for many linear functionals and common func- 
tion classes of interest. In this case, one must lose a logarithmic factor as 
the minimum cost for adaptation. A common example of such a case is esti- 
mating a function or a derivative at a point, that is, Tf = /^(io) for some 
s > when the parameter spaces are assumed to be Lipschitz. See Example 2 
below and [3, 14, 17]. 

Besides the regular case, there are two extreme cases. 



• Case 2. q{T\,T2) > min(g(J r i),g(J r 2 )) or q^i,^) = q(Fi) = 9(^2) • 



This is a case which is not covered in Corollary 1. Results given in Sec- 
tion 2 show that in this case adaptation for free is always possible. That is, 
one can attain the optimal rate of convergence over T\ and T2 simultane- 
ously. An example of this case is estimating a function at a point over two 
monotone Lipschitz classes. See Examples 1 and 3 below and [20]. 

• Case 3. q{T\,T2) < rnin(g(.Fi), q{J^2))- I n this case the cost of adap- 
tation is significant, much more than the usual logarithmic penalty in the 
regular case. If / is known to be in !Fi, one can attain the rate of n^ 1 ^; 
and if one knows that / is in T%i the rate of convergence n^ 2 ' can be 
achieved. Without the information, however, one can only achieve the rate 
of (n/ logn) 9 ^ 1 ' - ^" 2 ) at best. So the cost of adaptation is a power of n rather 
than the logarithmic factor as in the regular case. See Example 2 below. 

Note that if the the parameter spaces T\ and T2 are nested, then only 
Cases 1 and 2 are possible and Case 3 does not arise. 

We now consider a few examples below to illustrate the three different 
cases. Examples 1 and 3 cover Case 2 in which full adaptation is possible. 
Example 2 covers both Case 1 and Case 3 with different choices of pa- 
rameters. In each of these examples we need to calculate the between-class 
modulus of continuity. The basic idea behind these calculations is contained 
in [9] and consists of finding extremal functions. See [4] for the details of 
these calculations. 
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Example 1. In this example we shall have < q{J : 2 ) < q{3~i->3~2) = 
q(F\) < 1 and u(e, J^i, F 2 ) = ^(£,^2,^1)- In this case full mean squared 
error adaptation is possible. 

For < a < 1, let 

(36) F(a, M) = {/ : [-§, ±] - R : \f(x) - f(y)\ < M\x - y\ a }. 

Let D be the set of all decreasing functions and let Fjj(a,M) = F(a,M) n 
D be the set of decreasing functions which are also members of F{a,M). 
Let Tf = /(0) and assume that < a 2 < oc\ < 1. Let T\ = Fz>(ai, M\) and 
•7"2 = Ff)(a 2 ,M 2 ). Then for these parameter spaces and the linear functional 
Tf = /(0) it follows from calculations in [4] that, as e — > 0, 

u 2 {e, Tx, T 2 ) = u?{e,T 2 ,T\) 

(37) 

= ( 2ai + i)«i/(2«i+i) Ml 1/(2Ql+1) e 2 ^/^ +1 ) (l + o(l)), 

(38) u 2 (e, ^i) = (ai + 1)^/( 2 ^+ 1 )m i 1/(2qi+1) £ 2qi /( 2qi+1 )(1 + o(l)), 

(39) a; 2 (e,^ 2 ) = (a 2 + i)^/(2a2+i) M i/(2a 2 +i) e 2a 2 /(2a 2 +i) (1 + o(1)) 

In this case J^ 2 ) = max(g(^ r i), q{J r 2)) > min( q(Fi), q{^2)) and hence 

adaptation for free can be achieved. 

Example 2. This example shows that sometimes we must lose more 
than a logarithmic factor when we try to adapt. Let 

F R (a, M) = {/:[-§, |] - R: |/W(x) - < M|z - 

0<x<y<i}, 

where s is the largest integer less than a. Similarly, let 

F L (a, M) = {f: [-|, |] -R: - f {s \y)\ < M\x - y\ a ~ s 

-\<x<y<Q}. 

Finally let F(ai, Mi, a 2 , M 2 ) = F L (ai, Mi) n F R (a 2 , M 2 ). 

Note that for the linear functional Tf = /(0) and the (ordered) parameter 
spaces T\ = F(a\ , M\ , a 2 , M 2 ) and T 2 = F((3\ , N±, f3 2 , N 2 ) it follows from the 
calculations given in [4] that 

(40) u 2 (e,T 1 ) = C(a 1 ,Ad 1 ,a 2 ,M 2 )e 25 ^ 25+1 Hl + o(l)), 

(41) u 2 (e, T 2 ) = CVh^faNzyKW) (1 + o(l)), 

where 5 = max(ai, a 2 ) and p = max(/?i, /3 2 ). 

Now let < a 2 < ai < 1 and < ft < /3 2 < 1. Then q(T x ) = gga^ and 

~~ 2 |f+i • The between-class modulus satisfies 

(42) u; 2 (e,^ 1 ,^ 2 )=C(M 1 , ai ,M 2 ,a 2 ,N 1 ,(3 1 ,N 2 ,p 2 )e 2 ^ 2 "' +1 \l + o(l)), 
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where 7 = max(min(ai, /3i),min(o!2, Pt))- 

Two interesting cases arise, depending on the relationship among ax, 012, Pi 
and p2- 

• P2 > Pi > oi\ > «2 ■ Then the quantity 7 in (42) is 7 = a\ and so 9(^1 , F2) = 
r. 201 ] -1 . Hence in this case 

5(^1,^2) =min(g(J r i),g(J r 2)) < max^(fi), 9(^2)) • 

This is a case where a logarithmic penalty term must be paid for adapta- 
tion. 

• 011 > P2 > Pi > «2- I n this case, the quantity 7 in (42) is 7 = /?i and hence 
q{Ti,T2) = 281+1 ■ Therefore in this case 

q{Fl,F2) < min(g(J r i),g(J" 2 ))- 

Consequently the cost of adaption between Ti and T2 is much more than 
a logarithmic penalty. The maximum risk over the two spaces is of the 
order n -2/3i/(2/?i+i). 

A particularly interesting case is when ai = P2 > Pi > 02- in this case, 
the minimax rates of convergence over Ti and T2 are the same, both 
equal to n~ 2 ^ 2 '^ 2+l > . Yet it is impossible to achieve this optimal rate 
adaptively over the two parameter spaces; in fact the cost of adaption in 
this case is substantial. 

Example 3. This will give an example where < q(J~i) < q(J-i,J-2) < 
Q'O^) < 1- It will also yield an example where u{e,Fi,F2) 7^ u(e,J r 2,Pi)- In 
this case full mean squared error adaptation can be achieved. Let Tf = /(0). 
Now let 

F D (01 , Mi, a 2 , M 2 ) = F(ai , Mi , a 2 , M 2 )nD, 

where F(a\, Mi, 012, M2) is defined as in Example 2. 

Let Pi > P2 > on > «2- Calculations in [4] yield for the (ordered) param- 
eter spaces Ti = F D (ai, Mi, a 2 , M 2 ) and T 2 = Fd(Pi, Ni, P 2 , N 2 ), 

w 2 (e,^i) =Ce 2Ql/(2ai+1) (l + o(l)), 

(43) 

w 2 (e ^ 2)=Ce 2/3i/(2 ft+ i) (1 + o(1))) 
u\e,T l ,T2)=CeW^ l Xl + o{l)), 

^ W 2 ( £ ,^2,^i)=C £ 2 ' 3l /( 2 ' 3l+1 )(l + (l)). 

Hence this is an example where u(e, Ti,J : 2) ^i 6 1^2,^1) (I + o(l)). Note 
that Pi> P2. It then follows from (44) that q{Ti, T2) = gjf+i ■ H ence this is 
an example where 

< < g(^i,^ 2 ) < q{T 2 ) < 1. 
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In particular, q(Fi, ^2) > m ^ n (q(Fi) , qi^)) , so it is also an example where 
full mean squared error adaptation is possible. 

4. Adaptation over many parameter spaces. Section 2 gives a complete 
treatment of adaptation over two convex parameter spaces. It is shown that 
the between-class modulus determines the cost of adaptation and the ordered 
modulus can be used for the construction of optimally adaptive procedures. 
This theory of adaptation over two parameter spaces is in turn a fundamental 
building block for adaptation over richer collections of parameter spaces. 
We first extend the theory to adaptation over collections of finitely many 
parameter spaces. Section 5 further generalizes the theory to collections of 
infinitely many parameter spaces. 

The basic idea for the construction of adaptive estimators builds on that 
given for two parameter spaces. In particular the adaptive estimator is based 
on the construction of tests between pairs of parameter spaces. The resulting 
estimator is optimally adaptive in the sense defined in Section 2: it attains 
the lower bound on the cost of adaptation over finitely many convex param- 
eter spaces which satisfy certain regularity conditions on the moduli. We 
shall begin by assuming that the parameter spaces are nested, in which case 
these conditions are always satisfied. 

4.1. Adaptation over nested parameter spaces. Let T\ C T% C • • • C J~k 
be closed convex parameter spaces and for convenience set J-q = 0. In this 
context the goal of adaptation is most easily described sequentially. First, 
the estimator should attain the exact minimax rate of convergence over 
T\. Given the performance over T\, the estimator should attain the lower 
bound as given in Theorem 1 over T?,. Moreover, for i > 3 the estimator 
should attain the lower bound given its performance over Tii ■ ■ ■ 

We shall introduce some notation before explaining the lower bounds in 
detail. For i ^ j, define the quantity jij > as follows. If iAj = min(i,j) = 1, 



let 



(45) 




and 




If i A j > 2, define jij and 7ij,+ recursively by 




(46) 



x 



( 



max 

l<m<iAj — 1 




hi 7m, iAj, + 
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-In 



and 

2 



7»j,+ = DMx(7ij.7i,t) 

tx f e, u; 2 ^^7=' ' -^i J 



(47) x ( max. L» ( f y 



Let Ai(n) > be denned by Af(n) = w 2 (^, J^i) and, for 2 < i < k, 



(48) 4 2 W= maxl, 2 ( J^^t^^M + w 2 f -1= >^ 



l<m<i-l I ' V V n J ) VV'' 7 ' 

Suppose that Cj > are some constants for i = 1, . . . fc. If T is an estimator 
satisfying 

(49) sup£(T-T/) 2 < Cl ,4 2 (n), 

then Theorem 1 shows that the estimator T must satisfy a lower bound over 

^2, 

(50) sup E(f - T ff > d 2 A 2 2 (n) , 

where > is a constant. More generally for 2 < j < fc, if an estimator T 
satisfies 

(51) sup£(T-Tf) 2 < Cj ,4 2 (n) for i = l,...,j-l, 

then Theorem 2 shows that the estimator T must satisfy a lower bound over 

(52) sup£(T-T/) 2 >cLA 2 (n) 

for some constant dj > 0. It is thus natural to seek an estimator which 
attains (51) for all 1 < i < k for some constants Cj > 0. In light of the lower 
bound (52), such an estimator can also be termed optimally adaptive. 
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We now turn to the construction of such adaptive estimators. As in 
Section 2.2, let % be linear estimators satisfying sup j e:F . E(Ti — Tf) 2 < 
o; 2 (-^ ; JT^). The procedure, which will be defined precisely below, can be 
described sequentially as follows. 

1. Test between T\ and Ti for all 2 <i <k. 

2. If all the tests are in favor of J-\, use T\ as the estimate of Tf. 

3. Otherwise, delete T\ and repeat steps 1 and 2. 

The performance of this procedure depends critically on the properties of 
the tests between pairs of parameter spaces. The tests are developed in a 
similar but somewhat more involved way than those in Section 2. 

For i ^ j let Tjj be the estimator satisfying (12)-(14) with J 7 = J 7 ^H = J 7 j 

and V = of j, where afj = oj 2 {\J ln ~^' J ,T{, Tj). Then note that as in 
Section 2.2, if / G 



(53) E{f i -f itj )>-u3(^=,^ -J J^M,Ti,F 3 \ =-b itj , 



n I \ V n 



(54) E{fi-t^<^(A=,J\\ =bj,i, 




,2/ 1 T \ 1 2 / ln 7i,i 



(55) Var(Tj - T M ) < 2 ( ^ ( — , ^ j + Fj 



1 , 1 ^.2/./ ln 7ij 



(56) Var(T 4 - T hl ) < 2( <S[ j + —<S U-^,^ 



For i < j the test between Ti and J- a is given by 

I itj = l(fij - (4(2fc) 1/2 + 1)6, . - 4fc 1 /%(n) < % 

(57) 

< T j:i + (4(2fc) 1/2 + l)b jA + Ak l ' 2 A 3 {n)). 

The test is in favor of T% if I%j = 1. Our adaptive estimation procedure is 
defined in terms of the tests Jj j and the minimax rate optimal estimator 
over J-}, T{. The procedure is defined sequentially from "inside-out." It first 
tests if / € F\ by checking whether Ilj>2^i,j = 1> which means that all the 
tests I±j are in favor of T\. In this case T\ is used. Otherwise T\ is deleted 
and the procedure iterates. More formally, the estimator T* can be written 
as 

(58) f* = £ (i - n n ( n ^) **■ 

i=l \ m<i j>m / \j>i / 
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The following theorem shows that this procedure is optimally adaptive in 
the sense that it attains (51) for all 1 < i < k. 

Theorem 4. Let T\ C Ti C • • • C T^ be closed convex parameter spaces. 
Then the estimator T* defined in (58) is optimally adaptive over T for 
i = 1, . . . , k. More specifically, 



l 

n 



(59) sup E(T* -Tf) 2 <Cu 2 [^=, T 

and for 2 < i < k, 

sup E(f* - Tf) 2 

(60) 







max < 




\ l<m<i-l 1 









+ , T m , Ti | 1 + lo 2 (—=,J : i 



n J l \v n 

The basic ideas for the proof of Theorem 4 are similar to those of Theorem 
3, but the calculations involved are more complicated. There are two main 
concerns which need to be addressed. For f £ J 7 i \ Ti-\ one concern is that 
the test stops too late and uses Tj for some j > i. Lemma 3 below shows 
that this probability is small. The other concern is that the test stops early 
and uses Tj for j < i. This is only a problem when the bias of Tj is large. 
We shall show that if that is indeed the case, then the chance of using such 
a Tj is small. The specific bound is given in Lemma 4. 

Lemma 3. If f £ Ti, then for j > i, 

Af(n) 



(61) P(T*=Tj)<k 



Aj(n) ' 

where Ai(n) is defined as in (48). In particular, (61) holds for f £ T \T-i. 

PROOF. It follows from (53)-(56) that for f £ T and i<m<k 

P(Ii,m = 0) < P(fi - f iim < -{4{2k) 1 ' 2 + l)6i im - Ak^A^n)) 
+ P(fi - f m>i > (4(2fe)V2 + i) hmi + 4fc 1 / 2 A m (n)) 
{4{2k) l l 2 b hm + 4k l / 2 A m {n)) 2 * 



< exp 
+ exp 



(4(2fc) 1 / 2 ^+4A: 1 / 2 A m (n)) 2 
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First note that if w 2 (^=,J^) > ^—^ 2 {\f^^,^ i ,J : rn ), then v i:Tn < 4A 2 (n). 
Since e~ 2kx < \x~ 2k for x > and fe > 1, it follows that 



(62) 

On the other hand, if w 2 (^,^) < T ^—uj 2 { ] f^^,^ i ,^ m ) : then 
(4(2fe) 1 /2^ m + 4fc 1 /2A m (n))^ 



exp 



2"Wj m 



< exp 



(8/hi7 iim )w 2 (y / ln7 iim /n, Ti,T m ) 
(W > <exp(-(4Mn T „„ + 2 ^ ffi> )) 



< 



'w 2 (l/ v /n,^ r i,JP"„ 



w 4fc (l/^,^,^m) 2 ^(n) 
lAf (n) 



Combining (62) and (63) yields exp(- (^^^ m(n)? ) < ^ 
A similar argument yields exp(- ( 4 ( 2fc ) 1/2fc ^+ 4 ^ 1/2 ^(")) 2 ) < There- 
fore, 



(64) P(/ iim = 0) < 4£t4 for 1 < i < m < fc. 

Now note that for j < m < i, Jj, m ,+ < Jj,i,+ an d consequently oj+ ( \Z ^ 3 „ m,+ ; •T'j; 

-^m) < ^+(\/ ln7 n;" + "^j^»)- It then follows that Ai(ra) are nondecreasing in 
i and from (64) that 

(65) p(i m =o)< y p(i ml = o)< y 4lM <fe f3^\ - 

Set Ij = rii>j^,j- Tlien 

/j-i \V(j-*-i) 

(66) P(f*=f j )< min P(I m = 0) < T P(I m = 0) 

By combining (65) and (66), it follows P(T* = Tj) < fc 4ir~] ■ ^ 
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Lemma 4. Suppose f £ Ti \ and j < i. If\ETj - Tf \ > X(bj^ + b it j + 
Ai(n)) for some A > 4(2/c) 1 / 2 - 2, then 

(07) P^sexp p-^f- 2 ' 2 ). 

Proof. We shall only consider the case when £JTj — T/ > X(bj ; i + 6$ + 
Aj(n)) since the case of ETj — Tf < —X(bjj + bij + Ai(n)) can be handled 
similarly. Let / 6 Ti \ Then P(f * = Tj) < P(I jti = 1) < P(T 3 - - f-j < 

(4(2/c) 1 / 2 + l)6ij + 4/c 1 / 2 ^ i (n)). Note that 

E{fj - f i>:j - (4(2A:) 1 / 2 + l)bi d - Ak^A^n)) 

= E(fj - Tf) - E(fij - Tf) - (4(2k)^ 2 + l)b hJ - 4k 1 / 2 Mn) 



> Xbij + XAi(n) - U^M^^. 
~{A{2k) 1 / 2 + l)b id -Ak 1 l 2 A i {n) 



> (A - 4(2A;) 1 / 2 -2)1 u U^M,^ ) + ^(n) J . 



Nate that Vax^-Ty)^^ 
hence 



/ (A-4(2fc)V 2 -2) 2 M^ln 7 ij/n,^,^-) + ^(n))^ 
P(T = * 6XP l 2 ^ j 

<exp(-^- 4 ( 2 f 2 - 2 ) 2 ). 
We are now ready to prove Theorem 4. 

Proof of Theorem 4. The minimax rate optimality of f over T\ 
follows from Lemma 3: 

sup E(f* - Tf) 2 < sup £(Ti - Tf) 2 
f£Fi f&Fi 

+ ]T sup (PIT, - T/l 4 ) 1 / 2 . (P(f* = f,)) 1/2 
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< kuj 2 ( -U.? 7 ! 



n 



and hence (59) follows. The proof of (60) is somewhat more involved. Con- 
sider the case / G Ti \ T%-\ for some i > 1. Set 

Ji = {j < i : |£T 3 - - T/| < (4(2fc) 1/2 + 2)(b jti + b itj + A{n))}, 

J 2 = {j<i: \ETj - Tf\ > (4(2Af/ 2 + 2)(b jti + b itj + Mn))}. 

Then for j G Ji, - Tf) 2 < cu 2 (^,^) + (4(2fc) 1/2 + 2 ) 2 (b jti + b isj + 

Ai{n)) 2 . If j G J" 2 , then \ETj - Tf \ = \(b j4 + b i:j + A^n)) for some A > 
4(2/e) 1 / 2 + 2. Hence, by (30), 

(E\fj - Tf] 4 ) 1 / 2 < 3 Var(fj) + 3A 2 (6 j , i + by + A^n)) 2 

<4X 2 (b jA + b i:j +Ai(n)) 2 . 

It then follows from Lemma 4 that (P(f * = l})) 1 / 2 < exp(- ( A ~ 4 ( 2 ^ 1/2 - 2 ) 2 ). 
Hence, for / G ^ \ with i > 1, 

£(T* - Tf) 2 = ^ - T/) 2 l(f * = fj)} 

i=l 

< Y Eft - Tf) 2 + Y (E\fj - Tf\ A ) l ' 2 {P{f* = f 3 )f/ 2 

k 

+ E(fi-Tf) 2 + Y (E\f j -Tf\ 4 )^ 2 (P(f* = f j )) 1 / 2 
j=i+i 

^ E W (~l=^i) + (4(2A:) 1 / 2 + 2) 2 {b hl + b itj + A(n)Y 
+ Y 4A 2 (6 jVt + 6 lJ+ ^(n)) 2 .exp^ (^ - 4(2/c) 1 / 2 - 2) 2 



(n) 



w(^) + £ +i 6^,4^| 

<CM 2 (n), 

where C is a constant not depending on /. Note that in the last inequality 
we use the fact that A 2 exp(— ( A ~ 4 ( 2fc ) 1 ~ 2 ) ) i s bounded as a function of A. 
Hence 

sup E(T* - Tf) 2 
fen 
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max] sup E{f*-Tff\ 

l<m<ty f&Ft \ Tn _. J 



<C max{t( n )} = C4 2 (n) 

Km<i 







max < 




\l<m<i-l 







4.2. Adaptation over nonnested parameter spaces. Many common pa- 
rameter spaces of interest such as Lipschitz spaces and Besov spaces are not 
nested. However, they often have nested structure in terms of the modulus 
of continuity. Theorem 4 can be generalized to such nonnested parameter 
spaces. 

Let T, i = l,...,k, be closed convex parameter spaces which are not 
necessarily nested. For any parameter set T, let C.Hull (J 7 ) denote the convex 
hull of T. We shall denote by a(e) x 6(e) when a(e)/b(e) is bounded away 
from and oo as e — > 0+. Suppose the parameter spaces T satisfy the 
following conditions on the modulus of continuity: 

1. For / < % and m < j, u(e,Ti,T m ) < Cu(e,Ti,Tj) for some constant C > 0. 

2. For 2 < i < k, w(e, Q t ) x u(e, CHuU(ft)) where Q t = U^=i Fm- 

Note that conditions 1 and 2 are trivially satisfied if Ti are nested. 

As shown in [6], the minimax linear rate of convergence for estimating a 
linear functional Tf over a parameter set T is determined by the modulus 
over its convex hull, u(-^, C.Hull(JT)). Conditions 1 and 2 together yield 

u(e, T) x u(e, C.Hull (Gi)) and this consequently implies that for 1 < i < k 
there exists a rate optimal linear estimator Tj over T such that 

(68) sup Eifi-TffKCu^i^T^). 



n 



Now define the quantities 7jj and 7ij,+ as in (45), (46) and (47). Let T* 
be defined the same as in Section 4.1 with the minimax rate optimal linear 
estimator Ti over T satisfying (68). Under conditions 1 and 2 above, the 
estimator T* then achieves adaptation over the parameter spaces T with 
minimum cost. More precisely, we have the following. 

Theorem 5. Let T%, i = 1, . . . , k, be closed convex parameter spaces sat- 
isfying conditions 1 and 2 above and let the estimator T* be given as above. 
Then T* is optimally adaptive over T for i = 1, . . . , k, that is, 

(69) sup E(f* -Tff <Cuj 2 (^=, T x 
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and for 2 < i < k, 

sup E(f* -Tf) 2 
(70) 

The proof of Theorem 5 is essentially the same as that of Theorem 4. We 
omit the proof for reasons of space. 

5. Adaptation over infinitely many parameter spaces. Section 4 gives 
a construction of adaptive estimators over collections of finitely many pa- 
rameter spaces. In this section we shall further extend these results to a con- 
tinuum of parameter spaces when the penalty of adaptation is always a log- 
arithmic factor of the noise level. Well-known examples of such cases include 
estimating a function at a point over a collection of Lipschitz classes or Besov 
classes. 

Let {J-\ : A G A} be a collection of closed convex parameter spaces and T 
be a linear functional. Suppose that the following conditions hold for some 
constants < c\ < c 2 < oo and Eq > 0: 

CI. The index set A is an ordered set with min(A) = A* G A, max(A) = A* G 

A, and J r \ 1 C T\ 2 if Ai > A2. 
C2. For all < e < e and all A G A, c x e rx < u(e,F x ) < c 2 e rx where < 

r\ 2 < r Xl < 1 if A 2 < Ai. 
C3. For A2 < Ai and for all < e < eo, ^(£,^1) < w(e,F\ 2 ) an d to + (e,J 7 x 1 , 

C4. For any fixed < e < e$ the set T\) : A G A} is compact. 

Under these conditions it is clear from Theorem 1 that the minimum cost 
of adaptation is at least a logarithmic factor for any T\ with A < A*. We 
shall develop an adaptive procedure over the whole collection \T\ : A G A} 
which attains the exact minimax rate of convergence over T\* and attains 
the lower bounds given in Theorem 1 over any T\ with A G A and A < A*. 

The main idea behind the construction of the adaptive estimator is to 
first put down a finite grid of parameter spaces such that the modulus of 
continuity over each space on the grid is at least a fixed constant factor apart 
from the modulus for any other space on the grid; moreover, the modulus 
over any space in the collection \T\ : A G A} is at most a fixed constant factor 
away from the modulus over one of the parameter spaces on this grid. We 
then use the techniques developed in Section 4.1 to construct a procedure 
which is adaptive over the finite grid. This procedure which is adaptive 
over the grid is then automatically adaptive over the whole collection. The 
construction of the grid is based on the following simple lemma. 
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Lemma 5. Let Q, be a compact subset of the positive half line M+ such 
that there exists aiiwe!! satisfying min(f2) < u < ^max(ri). Then there 
exists a unique finite sequence £1 < £2 < ■ • ■ < Cfc with £3 G f2, £1 = min(O) 
and £k = max(O) such that > 2£j for all 2 < i < k — 1 and /or any lj G Q 
there exists 1 < i < k such that < a; < £j . 



The grid is constructed as follows. Set VL n = {ui(y ■^ L ,J r \) ■ A G A}. Then 
for sufficiently large n it follows from condition C4 that the set £l n is com- 
pact. If for all w G with u > min(r2), uj > ^max(O), then set k n = 2, 
£1 = min(f2 n ) and = max(f2 n ). Otherwise there is a sequence £1 < £2 < 

• • • < £fc n in f2 n satisfying the conditions given in Lemma 5. Let J r \ 1 C F\ 2 C 

• • • C J-\ k be the corresponding closed convex parameter spaces with \ G A 

and = ^(v/^p^Ai)- Note that it follows from the conditions Ai = A* = 
max(A), Afe n = A* = min(A) and k n < log 2 n for large enough n. For conve- 
nience write Ti for T\. This sequence of parameter spaces {J^i : 1 < i < k n } 
forms a grid over the whole collection of parameter spaces { T\ : A G A} such 
that for any A G A with A < A* , there exists 2 < i < k n satisfying T\ C jFj 
and 

en) ^(^ 1 )<„( 1 /iE, ft )<„(^). 

We shall now turn to the construction of the adaptive estimator based 
on this grid. The construction is different but similar to the one given in 
Section 4.1. Let T\ be a linear estimator satisfying sup^g^ ^(Ti — Tf) 2 < 
u 2 (-^,J-\). For 1 < i,j < k n with max(i,j) > 2 let Tij be the estimator 

satisfying (12)-(14) with T = T U H = F i and V = ^^(y^,^, Tj). 
For i < j the test between Ti and .T-j is given by 

The test is in favor of Ti if = 1. Our adaptive procedure is described 
sequentially in exactly the same way as given in Section 4.1. Formally the 
estimator T* can be written as 

(73) ^=i:(i-n u^Mu^M- 

i=l V m<i j>m / \j>i / 

The following theorem shows that this estimator is optimally adaptive 
over the whole collection of parameter spaces { T\ : A G A} . 
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Theorem 6. Let {J^x'.X 6 A) be a collection of nested closed convex 
parameter spaces and T be a linear functional. Suppose that conditions CI— 
C4 hold. Then the estimator T* defined in (73) is optimally adaptive over 
T\ for all AG A. More specifically, there exists a constant C > such that 



Remark. The structural conditions C1-C4 are used to keep track of a 
growing number of between-class moduli. These conditions seem to be nec- 
essary for developing an adaptation theory over infinitely many parameter 
spaces. The completely general setting is difficult because it is possible that 
the penalty for adaptation varies from space to space in a very complicated 
way from no penalty to a logarithmic factor to an algebraic factor. 

The proof of Theorem 6 is similar to that of Theorem 4. It relies on the 
analysis of the tests ijj. For / G T. L \ J~i-i, the main analysis is concerned 
with the cases where Tj with j < i is used and where Tj with j > i is used. 
Lemma 6 shows that the chance of using Tj with j > i is small. Lemma 7 
shows that the chance of using Tj with j < i is small whenever the bias of Tj 
is large. Before presenting these technical results we first collect some useful 
bounds on the expectations and variances of Tj — T^j and Tj — 27^. 

For j > 2 set 



(74) 




and for all A G A and A < A* 



(75) 




(76) 




and write Tj for Tjj. Note that if / G then for j > i 



(77) 




(78) 




For the variances, note that if / G T\ and j > 2 
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(79) 



(80) 



-^..Fi) + — Lu; 
n ) log n 

Var(Ii -T j l )<2\u) l [^=,F 1 \ + -^-w 

m / logn 

' logn 



77 



'logn 



'logn 



n 



'logn 



n 



and ii f E J~i with 2 <i < j, 



Vax(Ti - Tij) < 2 



1 



(81) 



logn 



< u 



'logn 



+ u 

I logn 



'logn 



n 



(82) 



Var(f i -f, i )<2U-^(J^ ; 

\ logn \ y n 



< Vj 



Ti + : W 

/ logn 



'logn 



•F7 ) F" i 



Lemma 6. If f eFi, then 



(83) PT=T 2 <4exp-2 + 2fc n .n , 

and for j > 3, 

(84) P(f* = Tj) < min P(J m = 0) < 2/c n n" 2 . 

i<m<j— 1 

If f € J-i with i>2, then for j > i, 

(85) P(f* = f j )<2k n n- 2 . 
In particular, (85) /ioWs /or f £ J r i\ Ti-\ . 

Proof. First note that (84) follows from (85) so we need only prove 
(83) and (85). We first prove (85). It follows from (77), (78), (81) and (82) 
that for f € J~i and 2 < i < j < k n , 



P(Ii,j = 0)<P[f i - f id <~u( J ^ , Tj 



n 
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■6 



11 / / log n 



((ll/2)cu(V^7^,^-)-^) 2 



< exp 



2vj 



{{11/2)^(^^1,^) -btf 



2 Vj 



(ll/2-3/2)W(^^,^)\ _ o _ 2 



< 2exp — i — - — ; v ° • " J± = 2n 

V (8/logn)w 2 (vlogn/n, Tj) J 

Set h = Uj^IiJ- Then = 0) < £^ m+1 P(/ m ,Z = 0) < 2fc n^" 2 and 

hence 

(86) P(f* = f j )< mm P(I m = 0) < 2k n n~ 2 . 

i<m<j— 1 

This proves (85). Now assume that f E^F±. Then for j > 2, 



(U/2-3/2) 2 u; 2 (Vbi7^,^) 



4a; 2 (l/V^, ^i) + (4/logn)w 2 (v^ogn7n,^) 



< 2 exp 



We consider two cases. First if ui 2 (-^,Ti) < 2 ( y^ 1 ,^), then 

P(/ u = 0) < 2 exp (- (H/2-3/2)V(yT5iS75.^)\ = 2 „- 2 
V (8/logn)u; 2 (A/logn/n,^ : ' :; ) / 

and hence P(f* = T 2 ) < 2k n n~ 2 and (83) follows. 



Now 



suppose that oj 2 (^,^x) > i^ 2 (V Let j = be the 



largest integer such that w 2 (^,^i) > i^^ 2 (y^, ^j). Then for j* + 1 < 
i < fc n it is easy to see that P(Iij = 0) < 2n~ 2 . For 2 < j < j*, 



P(h,j = 0) < 2 exp 



(11/2 - 3/2) 2 a; 2 (Vlogn/n,^) 



< 2 exp -2 



4w 2 (l/^,^i) + (4/logn)w 2 (^ogn7ri,^; 
^(Vlogre/n,^) 



a; 2 (l/V^,^i) /' 
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Note that by the construction of the grid of parameter spaces \T% : 1 < % < 
k n } for j > 2, 



2 i 



Hence, 



Pit* = T 2 ) < P(h = 0) < p (h tj = 0) 

<2Eex P f-2^S^) + 2 £ 



n 2 



J=2 



u; 2 (l/^,^i) 
and once again (83) follows. □ 

Lemma 7. Suppose f e « nd J < *■ If\ETj-Tf\ > /fo^/H",^ 

/or some /3 > 6, i/ien 

(87) P(f* = f j )<n-( /3 - 6)2 / 8 . 



Proof. We shall only consider the case when ETj-Tf > /3lo(J 



since the case of ETj — Tf < —[3uj{J ^p,.Ft) can be handled similarly. Let 



f€Fi\ Fi-i. Then P{f* = Tj) < P(I jti = 1) < P(f j - f-j - ^wty^, 
Fi) <0). Note that 



£7(2} - T/) - £7(2^ - T/) - H w L/^^. 



71 
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Note that Var(f j - T M ) < j^oj 2 (y ^ , Ti) and hence 

1 V 2 (4/log n)u; 2 ( Vlog n/n,?i)J 

< n -(/3-6) 2 /8_ □ 

We are now ready to prove Theorem 6 using the above technical results. 
We first show that the estimator T* given in (73) has the desired adaptation 
properties over the grid \Ti : 1 < i < k n } and then show that T* is in fact 
adaptive over the whole collection of parameter spaces {J-\ : A € A}. 

Proof of Theorem 6. The proof is broken into three steps. In each 



step it is important to note that k n < log 2 n and uj{J Fj) < \u){ y jT^+i) 

Step 1. We begin by showing that the estimator T* attains the exact 
minimax rate over T\ =J-\*. We shall only consider the case oj 2 (-^, J-\) > 



Efe^CV^^)- When w (v^'^ 1 ) < V^'^) the proof is eas- 

ier. Note that 

sup E(f* - Tff < sup E(Ti - Tff 
f&Fi f&Fi 

+ sup {E\f 2 - T/l 4 ) 1 / 2 • (P(T* = T2)) 1 / 2 
+ £ sup (E\fj - T/l 4 ) 1 / 2 • (P(f * = f,)) 1 ' 2 



+ Cu; 2 ( W : ^,^"2 I -2exp 



logn \ / ^(y/logn/n,^) 



n ' / V w 2 (l/^/n,^ r i) 



i=2 



Now note that 



2/ / lo g ra ^ \ / w (^logn/ra, ^2) 
uj \ ,T 2 exp' 



2 / /logn \ w 2 (Vlogn/n,^2) / uJ 2 (y/\ognJn,J r 2 )\ 

^ V~^T - 2 d/v^^i) exp r "'(w^o J 
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s 2 l °g n -r- \ -x^ 1 2 / lo g n -r 



and 



fy (ffV,) • (2k n )^n-i < 2a, 2 ■ (2^)^ 

= o(n _1 ). 

Hence sup^g^ ^(T* — T/) 2 < Ca; 2 (--^, Ji) for some absolute constant C > 
0. 

Step 2. Now consider i>2. Let f £ J~i\ Ti-\ for some i>2. Set 



J 1 = {j<t:|Ef i -r/|<7w(i/^,^ 



n 



= { J < i: \ETj - Tf\ > 7uj ( J 



Then for f £ J r i\ J~i-i with i > 1 we have 

i^f 1 * _ Tf) 2 = ^2 E{{fj — T/) 2 1(T* = Tj)} 
j'=i 

< ^ -T/) 2 1(T* =£,•)} 

+ ]T (£7|fj- - Tf\y/ 2 {P{f* = fj)) 1 / 2 
3&J2 

fen 

+ E{fi-Tff+ (E\f j -Tf\ 4 ) 1 / 2 (P(f* = f j )) 1 / 2 
j=i+i 

= S 1 + 5 2 + S 3 + 5 4 - 
We bound the four terms separately. First consider Si. Note that 
S x = E{(f j -Tf) 2 l(f*=f j )} 

<2J2 E i[(Tj ~ ETjf + (ETj - Tf) 2 ]l(f* = Tj)} 

<2Y, Var(Tj) + 2 ]T (£0) - Tf) 2 P(f* = Tj) 
jeJi jeJi 



32 T. T. CAI AND M. G. LOW 

y 

r— ' loera 



'logn 



Now consider S2. If j £ J2, then |-ETj — T/| = Pj^iy^^^i) for some 
/3j > 7. Hence by (30), 



{E\f 3 - Tf\ 4 ) 1/2 < 3 VarCZ}) + 3/3,V ( 



It follows from Lemma 7 that (P(f* = I))) 1 / 2 < n -(ft-6) 2 /i6. Hence 
S 2 = ^ (E\fj - Tf\ 4 )^ 2 {P{f* = tj)ft 2 



n 



<^ll^jr\ 4fcnn -l/16 x 2 n -((,-6)2-l)/16 



'logn 
n 

For S*3 it is clear from the construction of Ti = Ta that 



S3 = ^f J -r/) 2 <2a; 2 (^,J 



Finally for 6*4 it follows from (30) and Lemma 6 that 
5 4 = y {E\tj - Tf\*)V 2 (P(f* = fj)) 1 / 2 

j=i+l 
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Putting the four terms together we have that, for / € Ti \ J-%-i with i > 2, 



E(f* - Tff <S 1 + S 2 + S 3 + S i < Coj 2 hp^.-Fi 



n 



where C is an absolute constant not depending on /, n, k n and i. Hence for 
all 2 < i < k n , 



sup E(f*-Tf) 2 = max] sup E(f*-Tf) 2 



S'tep 3. Steps 1 and 2 show that the estimator T* is adaptive over the 
grid \Ti : 1 < i < k n }. It is now easy to show that T* is in fact adaptive 
over the collection {T\ : A G A}. Note that for any As A with A < A*, by the 
construction of the grid {Ti : 1 < i < k n }, there exists 2 < i < k n such that 
F\ C Fi with 



Hence 

sup E(T* - T/) 2 < sup E(f* - Tf) 2 



log n 



and the theorem is proved. □ 



Remark. Similarly to the case of finitely many parameter spaces, the 
results given above for infinitely many nested spaces can be extended in 
a straightforward way to nonnested parameter spaces when the moduli of 
continuity have nested structure under conditions similar to those given in 
Section 4.2. 
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